Using R to solve some basic questions and to understand important aspects
Let’s start
Download the R code (It is briefly commented)
You will find XXX where I will ask you to code a bit
Part of the code after XXX, depends on successfully filling these line of code
Generalized Linear Models
GLMs
Many times, linearity between a dependent variable and a set of predictors is not a suited assumption. Normal linear models rely on a normality assumption, and this may be a limitation when the dependent variable is discrete, e.g. with support \(\{0,1\}\) or \(\{0,1,2, \ldots, \}\), or it is continuous and its range spans from 0 to \(+\infty\). In normal linear models, the main assumption is (including the intercept \(x_{ij} = 1\)): \[ E[Y_i|{\bf x}_i]= \eta_i = \sum_{j=1}^{p}\beta_jx_{ij}.\] Instead, the generalized linear model consider modelling a suitable transformation of the conditional expectation by using covariates for a broad range of distributions, that is (including the intercept \(x_{ij} = 1\)) \[g(E[Y_i|{\bf x}_i])= \eta_i =\sum_{j=1}^{p}\beta_jx_{ij}.\] The main ingredients of a GLM are: the distibution of \(Y_i\), which belongs to the exponential dispersion family, the linear predictor (\(\eta_i\)), and the link function\(g(\cdot)\) that must be twice differentiable.
Logistic regression
Example
The following example is from Chapter 5 of Data analysis using regression and multilevel/hierarchical models, Gelman, A., & Hill, J. (2007), Cambridge University Press.
It is known that water in some wells in Bangladesh and other South Asian countries might be contaminated with natural arsenic. The effect of the arsenic ingestion on health is a cumulative poisoning, and the risk of cancer and other diseases is estimated to be proportional to the dose intake.
A research team from the United States and Bangladesh examined the water of several wells in Araihazar district of Bangladesh and they measured the arsenic level classifying the wells as safe if the arsenic level was below the 0.5 in units of hundreds of micrograms per liter, or unsafe if it was above 0.5. All the people using unsafe wells have been encouraged to switch to a nearby well. Indeed the contamination does not depend on the proximity and close wells can have very different levels of arsenic.
Logistic regression
Example
The dataset in the file Wells.csv contains information on whether or not 3020 households changed the unsafety wells after few years from the arsenic investigation. The variables are
switch: whether or not the household switched to another well from an unsafe well: no (0) or yes (1);
arsenic: the level of arsenic contamination in the household’s original well (in hundreds of micrograms per liter); note that all are above 0.5, which was the level identified as “unsafe”;
distance: distance to the closest known safe well (in meters);
education: education of the head of the household (in years);
association: whether or not any members of the household participated in any community organizations: no (0) or yes (1).
Goal: Using a logistic regression, we would determine which are the main factors of well switching among the users of unsafe wells.
We start by reading the data, by analysing the structure and exploring some characteristics of the data.
Logistic regression
data <-read.csv2("Wells.csv", header =TRUE)str(data)
switch arsenic dist assoc
Min. :0.0000 Min. :0.510 Min. : 0.387 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.820 1st Qu.: 21.117 1st Qu.:0.0000
Median :1.0000 Median :1.300 Median : 36.761 Median :0.0000
Mean :0.5752 Mean :1.657 Mean : 48.332 Mean :0.4228
3rd Qu.:1.0000 3rd Qu.:2.200 3rd Qu.: 64.041 3rd Qu.:1.0000
Max. :1.0000 Max. :9.650 Max. :339.531 Max. :1.0000
educ
Min. : 0.000
1st Qu.: 0.000
Median : 5.000
Mean : 4.828
3rd Qu.: 8.000
Max. :17.000
Logistic regression
par(mfrow =c(1, 2))boxplot(dist ~switch, data = data, horizontal =TRUE,xlab ="Distance (in meters)", ylab ="Switch")boxplot(arsenic ~switch, data = data, horizontal =TRUE,xlab ="Arsenic (in hundreds of mg per liter)", ylab ="Switch")
We start by modelling the effect of the variable distance on the variable switch. Since the outcome variable is a binary variable, the model chosen is the logistic regression model. \[ \rm{Pr}(Y_i=1| dist_i)=E(Y_i | dist_i)=p_i \quad \quad \text{logit}(p_i)= \eta_i = \beta_1 +\beta_2 dist_i\] where \(\text{logit}(x)=\log(x/(1-x))\) is a function that maps the range \((0,1)\) to the range \((-\infty,\infty)\).
Call:
glm(formula = switch ~ dist, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6059594 0.0603102 10.047 < 2e-16 ***
dist -0.0062188 0.0009743 -6.383 1.74e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1 on 3019 degrees of freedom
Residual deviance: 4076.2 on 3018 degrees of freedom
AIC: 4080.2
Number of Fisher Scoring iterations: 4
Logistic regression: Single predictor
Intepretation
The intercept (\(\beta_1\)) can only be interpreted assuming zero values for the other predictors. Here, when dist\(=0\) the estimated probability of switching for a family who lives close to a safe well is about \(65\%\), given by \[\hat P(Y_i = 1 | dist_i = 0) = \frac{e^{\hat \beta_1}}{1+e^{\hat \beta_1}}= \frac{1}{1+e^{-\hat \beta_1}}\]
predict(fit1, newdata =data.frame(dist =0), type ="response")
1
0.6470185
Logistic regression: Single predictor
Intepretation
Interpreting the coefficients for the predictor dist on the log-odds scale: the coefficient for dist can be interpreted as a difference of \(-0.0062\) in the estimated logit probability of switching well when the distance is increased by one, that is
From the previous expression, we can obtain the log-odds ratio \[\log\Bigg(\frac{P(Y_i=1|dist_i=c+1)/P(Y_i=0|dist_i=c+1)}{P(Y_i=1|dist_i=c)/P(Y_i=0|dist_i=c)}\Bigg)=\beta_2\] and so \[\frac{P(Y_i=1|dist_i=c+1)}{P(Y_i=0|dist_i=c+1)} = \exp(\beta_2) \frac{P(Y_i=1|dist_i=c)
}{P(Y_i=0|dist_i=c)}.\] The ratio between the probability of success (in this case to switch well) and of failure (that is the odds), when \(dist=c+1\) is estimated to be \(\exp(\hat \beta_2) = 0.99\) times the odds when \(dist=c\)
exp(coef(fit1)[2])
dist
0.9938005
Logistic regression: Single predictor
Intepretation
Interpreting the coefficients for the predictor dist on the probability scale: we can evaluate the difference of the logistic regression function by adding 1 to the predictor dist. This difference corresponds to a difference on the probability scale but requires the choice of the specific value of the predictor, that is \[P(Y_i = 1 | dist_i = c+1) - P(Y_i = 1 | dist_i = c) = \frac{1}{1+\exp(-(\beta_1 + (c+1) \times\beta_2))} - \frac{1}{1+\exp(-(\beta_1 + c \times\beta_2))}\] The mean of the predictor is a good starting point since it corresponds to the steepest point of the inverse logit function. Thus, adding 1 to dist (that is, adding 1 meters to the distance to the nearest safe well) corresponds to a negative difference of about \(0.15\%\) in the estimated probability of switching.
Note that the coefficients here does not have a linear effect on the probability that the outcome is 1 because of the non linearity of the model. The curve of the logistic function requires us to choose where to evaluate changes, if we want to interpret on the probability scale. Below the difference on the (estimated) probability of switching well due to an increase of one unit on the predictor from the 99th percentile
The effect of one unit increase of the variable dist on the inverse logit seems low, but this is misleading. Indeed, the distance is measured in meters, so this coefficient corresponds to the difference between, say, a house that is 48 meters away from the nearest safe well and a house that is 49 meters away. To improve interpretability of the results we rescale distance in 100-meter units. Thus, adding 100 meters to the distance to the nearest safe well corresponds to a negative difference in the probability of switching of about \(15\%\).
Logistic regression: Single predictor
Intepretation
Note that \(\hat \beta_2\) of the new model is 100 times the \(\hat \beta_2\) of the original one
Also note the following equivalence: the difference on probability (of switching well) scale for a unit increase of the new model corresponds to 100 times the same quantity in the original model
Call:
glm(formula = switch ~ dist100, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.60596 0.06031 10.047 < 2e-16 ***
dist100 -0.62188 0.09743 -6.383 1.74e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1 on 3019 degrees of freedom
Residual deviance: 4076.2 on 3018 degrees of freedom
AIC: 4080.2
Number of Fisher Scoring iterations: 4
[1] -0.1542287
[1] -0.1542287
Logistic regression: Single predictor
Intepretation
The probability of switching well is estimatated to be about \(60\%\) for those who live close to a safe well, declining to about \(20\%\) when the distance from a safe well increases up to \(300\) meters from their home. This seems reasonable: the probability of switching is higher for people who live closer to a safe well (Gelman and Hill, 2007).
We consider adding the arsenic level in our model. We expect that the higher the arsenic concentration the higher the probability of switching well, that is, translated in model terms, a positive sign for the arsenic coefficient. We plot the histogram of arsenic concentration before fitting the model and analysing the results.
Basically, we fitted the model \[\text{logit}(P(Y_i|dist^{*}_i, arsenic_i))= \eta_i = \beta_1 +\beta_2 \texttt{dist}^{*}_i + \beta_3 \texttt{arsenic}_i\]
(note \(dist^* = dist/100\))
fit3 <-glm(switch~ dist100 + arsenic, data = data, family =binomial("logit"))summary(fit3)
Call:
glm(formula = switch ~ dist100 + arsenic, family = binomial("logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.002749 0.079448 0.035 0.972
dist100 -0.896644 0.104347 -8.593 <2e-16 ***
arsenic 0.460775 0.041385 11.134 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1 on 3019 degrees of freedom
Residual deviance: 3930.7 on 3017 degrees of freedom
AIC: 3936.7
Number of Fisher Scoring iterations: 4
Logistic regression: Adding predictors
Interpretation
Given two wells with the same arsenic level (notice that it cannot be 0!), the estimated logit probability of switching decreases of \(.9\) every 100 meters in distance to the nearest safe well. \[\text{logit}(P(Y_i|dist^*_i=c+1, arsenic_i = k)) - \text{logit}(P(Y_i|dist^*_i=c+1, arsenic_i = k)) = \beta_2\]
For two equally distant nearest wells, a difference of 1 in arsenic concentration corresponds to a 0.46 positive difference in the logit probability of switching. \[\text{logit}(P(Y_i|dist^*_i=c, arsenic_i = k+1)) - \text{logit}(P(Y_i|dist^*_i=c, arsenic_i = k)) = \beta_3\]
Logistic regression: Adding predictors
Note
The following plots show the fitted model with two predictors. Numerical interpretation of one predictor effect on probability scale needs to be done choosing a value for the other predictor.
Left: probability of switching well vs distance when arsenic is equal to 0.5 and 1.5
Right: probability of switching well vs arsenic when the distance is equal to 0 and 50
par(mfrow =c(1, 2))with(data, plot(dist100, switch.jitter, xlab ="Distance", ylab ="Pr(Switching)", type ="n"))curve(InvLogit(coef(fit3)[1] +coef(fit3)[2] * x +coef(fit3)[3] *1.5), add =TRUE, col ="blue")curve(InvLogit(coef(fit3)[1] +coef(fit3)[2] * x +coef(fit3)[3] *0.5), add =TRUE, col ="red")points(data$dist100, switch.jitter, pch =20, cex = .1)text (0.50, .27, "if As = 1.5", adj =0, cex = .8, col ="red")text (0.75, .50, "if As = 1.0", adj =0, cex = .8, col ="blue")with(data, plot(arsenic, switch.jitter, xlab ="Arsenic", ylab ="Pr(Switching)", type ="n"))curve(InvLogit(coef(fit3)[1] +coef(fit3)[3] * x), add =TRUE, col ="red")text(1.5, 0.8, "if dist=0", col ="red")curve(InvLogit(coef(fit3)[1] +coef(fit3)[2] *0.5+coef(fit3)[3] * x), add =TRUE, col ="blue")text(4, 0.6, "if dist=50", col ="blue")points(data$arsenic, switch.jitter, cex =0.01, pch =20)
Logistic regression: Adding predictors
Interpretation
The following plots show the fitted model with two predictors. Numerical interpretation of one predictor effect on probability scale needs to be done choosing a value for the other predictor.
Left: probability of switching well vs distance when arsenic is equal to 0.5 and 1.5
Right: probability of switching well vs arsenic when the distance is equal to 0 and 50
Note
In general terms, the fitted model can be summarized as: switching is easier if there is a nearby safe well, and if a household’s existing well has a high arsenic level, there should be more motivation to switch.
Logistic regression: Including interaction term
Interactions
We now want to verify if there is a statistically significant effect of the interaction between the distance and the arsenic concentration on the probability of switching well.
fit4 <-glm(switch~ dist100 * arsenic, data = data, family =binomial(link ="logit"))pred4 <-predict(fit4, type ="response")summary(fit4)
Call:
glm(formula = switch ~ dist100 * arsenic, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.14787 0.11754 -1.258 0.20838
dist100 -0.57722 0.20918 -2.759 0.00579 **
arsenic 0.55598 0.06932 8.021 1.05e-15 ***
dist100:arsenic -0.17891 0.10233 -1.748 0.08040 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1 on 3019 degrees of freedom
Residual deviance: 3927.6 on 3016 degrees of freedom
AIC: 3935.6
Number of Fisher Scoring iterations: 4
Logistic regression: Including interaction term
Interactions
We fix the predictors to their mean and so the estimated probability of switching well for a averaged distance and averaged arsenic concentration is about 0.59. \[\hat P(Y_i = 1|dist^*_i = \overline{dist^*}, arsenic_i = \overline{arsenic}) = \frac{\exp(\hat \beta_1 + \hat \beta_2 \overline{dist^*} + \hat \beta_3 \overline{arsenic} + \hat \beta_4 \overline{dist^*} \times \overline{arsenic})}{1+\exp(\hat \beta_1 + \hat \beta_2 \overline{dist^*} + \hat \beta_3 \overline{arsenic} + \hat \beta_4 \overline{dist^*} \times \overline{arsenic})}\]
At the mean level of distance in the data, each additional unit of arsenic corresponds to an approximate 11% positive difference in probability of switching (the mathematical expression is similar to the expression above)
The interaction term can be seen (on the logit scale): as the effect (-0.18) added to the coefficient for distance for each additional unit of arsenic; indeed
\[{\rm logit}(P(Y_i|dist^*_i, arsenic_i = k + 1)) = \beta_1 + \beta_2 dist^*_i + \beta_3 (k + 1) + \beta_4 dist^*_i (k + 1) = \]\[ = \beta_1 + (\beta_2 + \beta_4) dist^*_i + \beta_3 (k + 1) + \beta_4 dist^*_i \times k \] So, the importance of distance as a predictor increases for households with higher existing arsenic levels (\(\hat \beta_2 + \hat \beta_4 = -0.58 - 0.18\))
Further, \[ {\rm logit}(P(Y_i|dist^*_i, arsenic_i = k + 1)) - {\rm logit}(P(Y_i|dist^*_i, arsenic_i = k)) = \beta_3 + \beta_4 dist^{*}_i\] So, at the average distance the difference in estimated logit probabilities is 0.46
coef(fit4)[3] +coef(fit4)[4]*mean(data$dist100)
arsenic
0.4695081
Logistic regression: Including interaction term
Interactions
The interaction term can be seen (on the logit scale): as the effect (-0.18) added to the coefficient for arsenic for each additional 100 meters of distance; indeed
\[{\rm logit}(P(Y_i|dist^*_i=c, arsenic_i)) = \beta_1 + \beta_2 c + \beta_3 arsenic_i + \beta_4 c \times arsenic_i\]
\[{\rm logit}(P(Y_i|dist^* = c+1, arsenic)) = \beta_1 + \beta_2 (c+1) + \beta_3 arsenic_i + \beta_4 (c+1) arsenic_i = \]\[ = \beta_1 + \beta_2(c+1) + (\beta_3 + \beta_4) arsenic_i + \beta_4 c \times arsenic_i \] So, the importance of arsenic as a predictor decreases for households that are farther from the existing safe wells (\(\hat \beta_2 + \hat \beta_4 = 0.56 - 0.18\))
Further, \[ {\rm logit}(P(Y_i|dist^*_i=c+1, arsenic_i = k)) - {\rm logit}(P(Y_i|dist^*_i=c, arsenic_i = k)) = \beta_2 + \beta_4 arsenic_i\] So, at the average value of arsenic the difference in estimated logit probabilities is -0.88.
coef(fit4)[2] +coef(fit4)[4]*mean(data$arsenic)
dist100
-0.8736527
Logistic regression: Including interaction term
Interactions
The following plots show that the differences in switching associated with differences in arsenic level are larger if you are close to a safe well, but with a diminishing effect if you are far from any safe well. Comparing with the same plot without interaction, note that the two curves mainly differ for higher values of distance.
par(mfrow =c(1,2))# Pr(Switching) vs Distance (Consider Arsenic=0.5 and Arsenic = 1)with(data, plot(data$dist, switch.jitter, type ="n",xlab ="Distance", ylab ="Pr(Switching)"))curve(InvLogit(cbind(1, x/100, 1.5, x/100) %*%coef(fit4)), add =TRUE, col ="red")curve(InvLogit(cbind(1, x/100, 0.5, 0.5* x/100) %*%coef(fit4)), add =TRUE, col ="blue")text(100, 0.6,"if As = 1.5", col ="red"); text(35, 0.4,"if As = 0.5", col ="blue")points(data$dist, switch.jitter, pch =20, cex =0.01)# Pr(Switching) vs Arsenic (Consider Distance = 0 and Distance = 50)with(data, plot(data$arsenic, switch.jitter, type ="n",xlab ="Arsenic concentration", ylab ="Pr(Switching)"))curve(InvLogit(cbind(1, 0, x, 0) %*%coef(fit4)), add =TRUE, col ="red")curve(InvLogit(cbind(1, 0.5, x, 0.5* x) %*%coef(fit4)), add =TRUE, col ="blue")text(2, 0.9, "if Dist = 0", col ="red"); text(3, 0.6, "if Dist = 50", col ="blue")points(data$arsenic, switch.jitter, pch =20, cex =0.01)
Logistic regression: Including interaction term
Interactions
The following plots show that the differences in switching associated with differences in arsenic level are larger if you are close to a safe well, but with a diminishing effect if you are far from any safe well. Comparing with the same plot without interaction, note that the two curves mainly differ for higher values of distance.
Logistic regression: Model Comparison and Model Selection
Model Comparison
More precisely, let \(\boldsymbol \beta = (\boldsymbol \beta^\top_0, \boldsymbol \beta^\top_1)^\top\), where \(\boldsymbol \beta_0=(\beta_0, \beta_1, \ldots, \beta_{p_0})\) and \(\boldsymbol \beta_1=(\beta_{p_0 + 1}, \beta_{p_0 + 2}, \ldots, \beta_{p})\). We want test, \[
\begin{cases}
H_{0}: & \beta_{p_0 + 1} = \beta_{p_0 + 2} = \ldots = \beta_{p} = 0\\
H_{1}: & \exists r \in \{p_0+1, \ldots, p\}: \beta_{r} \neq 0
\end{cases}
\] So, the LRT based on the deviance difference is (asymptotically) distributed as \(\chi^2_{p-p_0}\), and allows a comparison among models.
The residuals deviance drops significantly adding the distance and the arsenic variables, while marginal benefits are obtained by considering the interaction effect.
Logistic regression: Model Comparison and Model Selection
Model Selection
The model selection can be carried out also by using the AIC. It judges a model by how close its fitted values tend to be to the true expected values, as summarized by a certain expected distance between the two. Because smaller AIC is better, the AIC penalizes a model for having many parameters.
c( AIC(fit1), AIC(fit3), AIC(fit4))
[1] 4080.238 3936.668 3935.628
Note
The conclusions are similar to those reported above.
Exercise Explore the Anova() function of the car package.
Logistic regression: Binned Residuals
Binned Residuals
Outcome variable in the logistic regression can take only 0 and 1 values. The plot of the residuals defined as \[{\rm residual}_i = y_i - \hat {\pi}_i = y_i-\text{logit}^{-1}({\mathbf x}^\top_i\hat \beta)\] versus the fitted values (\(\hat \pi_i= \text{logit}^{-1}({\mathbf x}^\top_i\hat \beta))\) is not useful, that is considering \((\hat \pi_i, y_i - \hat \pi_i)\), because the points belong to two parallel lines with slopes -1. Let’s consider the model labelled as fit4.
pred4 <-predict(fit4, type ="response")res4 <-residuals(fit4, type ="response")plot(c(0, 1), c(-1, 1), xlab ="Estimated Pr(Switching)", type ="n",ylab ="Observed - Estimated", main ="Residual Plot")abline(h =0, col ="gray", lwd =0.5)points(pred4, res4, pch =20, cex =0.2)abline(c(0, -1), col ="red", lwd =0.25)abline(c(1, -1), col ="red", lwd =0.25)
Logistic regression: Binned Residuals
Binned Residuals
Outcome variable in the logistic regression can take only 0 and 1 values. The plot of the residuals defined as \[{\rm residual}_i = y_i - \hat {\pi}_i = y_i-\text{logit}^{-1}({\mathbf x}^\top_i\hat \beta)\] versus the fitted values (\(\hat \pi_i= \text{logit}^{-1}({\mathbf x}^\top_i\hat \beta))\) is not useful, that is considering \((\hat \pi_i, y_i - \hat \pi_i)\), because the points belong to two parallel lines with slopes -1. Let’s consider the model labelled as fit4.
Logistic regression: Binned Residuals
Binned Residuals
A more readable plot can be made using the binned residuals defined as: Dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin (Gelman and Hill, 2007, page 97).
The number of bins have to be chosen such that each bin is computed on enough points such that the resulting plot is not too noisy (high number of bins) but can highlight patterns in the residuals (hided if the number of bins is too low). Below, there is the function for obtaining the binned residuals (there are little changes w.r.t to the binned.resids() function of the arm package). The inputs are
The following plot shows the binned residuals vs the the estimated probability of switching. Grey lines depict \(\pm 2\) standard error (se) bounds, so we expect that 95% of the binned residuals falls between these two lines. The se is defined as \(\sqrt{p(1-p)/n}\), where \(n\) is the number of data used to compute each average residual.
br4 <-binned.resids(pred4, res4, nclass =50)$binnedplot(range(br4[,1]), range(br4[,2],br4[,6], -br4[,6]), type ="n",xlab ="Estimated Pr(Switching)", ylab ="Average residual", main ="Binned Residual Plot")abline(h =0, col ="gray", lwd =0.5); points(br4[,1], br4[,2], pch =19, cex =0.5)lines(br4[,1], br4[,6], col ="gray", lwd =0.5); lines(br4[,1], -br4[,6], col ="gray", lwd =0.5)
Logistic regression: Binned Residuals
Binned Residuals
The following plot shows the binned residuals vs the the estimated probability of switching. Grey lines depict \(\pm 2\) standard error (se) bounds, so we expect that 95% of the binned residuals falls between these two lines. The se is defined as \(\sqrt{p(1-p)/n}\), where \(n\) is the number of data used to compute each average residual.
Logistic regression: Binned Residuals
Binned Residuals
A pattern can be identified looking at the plot of the binned residuals with respect to the arsenic level. Indeed, there are large negative binned residuals for some households with wells with small values of arsenic concentration. These points correspond to people that are less likely to switch well with respect to what predicted by the model (almost \(-20\%\)). Moreover, the distribution of the residuals shows a slight pattern: the model seems to underestimate the probabilities of switching for medium arsenic level values while overestimates for high arsenic concentrations. So we plot the binned residuals vs the the predictors.
par(mfrow =c(1, 2))br.dist <-binned.resids(data$dist, res4,nclass =40)$binnedplot(range(br.dist[,1]), range(br.dist[,2], br.dist[,6], -br.dist[,6]), type ="n",xlab ="Distance", ylab ="Average residual", main ="Binned residual plot")abline(h =0, col ="gray", lwd =0.5)lines(br.dist[,1], br.dist[,6], col ="gray", lwd =0.5)lines(br.dist[,1], -br.dist[,6], col ="gray", lwd =0.5)points(br.dist[,1], br.dist[,2], pch =19, cex =0.5)br.arsenic <-binned.resids(data$arsenic, res4,nclass =40)$binnedplot(range(br.arsenic[,1]), range(br.arsenic[,2], br.arsenic[,6], -br.arsenic[,6]), type ="n",xlab ="Arsenic concentration", ylab ="Average residual", main ="Binned residual plot")abline(h =0, col ="gray", lwd =0.5)lines(br.arsenic[,1], br.arsenic[,6], col ="gray", lwd =0.5)lines(br.arsenic[,1], -br.arsenic[,6], col ="gray", lwd =0.5)points(br.arsenic[,1], br.arsenic[,2], pch =19, cex =0.5)
Exercise
Propose a possible solution to improve the model and fit your model. Compare the new binned residual plots with the previous ones.
Logistic regression: Binned Residuals
Binned Residuals
A pattern can be identified looking at the plot of the binned residuals with respect to the arsenic level. Indeed, there are large negative binned residuals for some households with wells with small values of arsenic concentration. These points correspond to people that are less likely to switch well with respect to what predicted by the model (almost \(-20\%\)). Moreover, the distribution of the residuals shows a slight pattern: the model seems to underestimate the probabilities of switching for medium arsenic level values while overestimates for high arsenic concentrations. So we plot the binned residuals vs the the predictors.
Logistic regression: Error rate
Error rate
Error rate is defined as the proportion of cases for which the fitted probabilities are above a threshold (commonly 0.5) and the outcome value is 0 or the fitted probabilities are below the specified threshold and the outcome value is 1, that is when the prediction-guessing is wrong: \[er = \frac{1}{n}\sum_{i=1}^{n}((y_i=0 \, \& \, \hat p_i >0.5)\, | \, (y_i=1 \, \& \, \hat p_i <0.5))\] It should be a value between 0 (perfect prediction-guessing) and 0.5 (random prediction-guessing).
Usually we expect the error rate to be lower than the error rate of the null model, that is the model including only the intercept. The estimated proportion for the null model will be \(\hat p=\sum_i y_i/n\), that is the proportion of 1’s in the data. The error rate for the null model is \({\rm min}(p, 1-p)\), and it corresponds to assign 1 (or 0) to all the fitted values.
[1] 0.4248344
Logistic regression: Error rate
Error rate
So if our model improve the prediction-guessing of a simple logistic regression, we expect the error rate of our model to be lower than the error rate for the null model. Our last fitted model (model labelled as fit4), with an error rate of \(0.38\), correctly predicts \(62\%\) of the switching choices, determining only a \(4\%\) of improvement if compared to the simply guessing that all the households will switch.
[1] 0.3758278
Error rate
Consider the error rate as the equivalent of the \(R^2\) for the linear model, it can be a useful measure of the model fit but can’t substitute a deeper evaluation of the model looking at coefficients significance and at the diagnostic plot. Analysing error rate means to extract information from the confusion matrix. Note that we can vary the threshold 0.5, and better assessment can be done by analysing the ROC curve.
0 1
FALSE 442 294
TRUE 841 1443
[1] 0.6241722
[1] 0.3758278
Exercise
Compute the error rate of your proposed model and compare it with the previous one.
Poisson Regression
Poisson Regression
Poisson Regression
Let’s assume to have a response variable representing a count. A sensible choice is to model \(y_i\), \(i=1, \ldots, n\), considering a Poisson distribution, which belongs to the exponential dispersion family and so a GLM can be fitted. The counts have a positive mean and it is sensible to consider a log-link function \[g(\mu_i) = \log(\mu_i)= \sum_{j=1}^{p} x_{ij} \beta_j, \quad i=1, \ldots, n\]
Then, the mean satisfies the \[\mu_i = E[Y_i|\mathbf x_{i}] = \exp \bigg(\sum_{j=1}^{p} x_{ij} \beta_j\bigg) \]
The mean of \(Y\) at \(x_{ij}=c+1\) is equal to the mean at \(x_{ij}=c\) times \(\exp(\beta_j)\), adjusted for the other explanatory variables, that is
\[\frac{E[Y_i|x_{ik}=c+1, \ldots]}{E[Y_i|x_{ik}=c, \ldots]} = \exp(\beta_k)\] So the expected values of \(Y\) for a unit change in \(X\) is \(e^{\beta_1}\) the expected values of \(X\).
Poisson Regression: Example
Example: Modelling claims frequency
The dataset SingaporeAuto.txt contains information on the number of accidents for a subset of data collected by the General Insurance Association of Singapore. Also some characteristics of the driver and the car are available. Only the data that refer to automobiles will be considered here.
The purpose of the analysis is to understand the impact of vehicle and driver characteristics on accident experience. These relationships represent the base for actuaries working in rate making, i.e., setting the price of the insurance coverages.
The variables and their description are the following
Female: Gender of Insured (female=1, male = 0)
PC: private vehicle (yes=1, no = 0)
Clm_Count: Number of claims during the year
Exp_weights: the fraction of the year the policy is in effect (an exposure variable)
NCD: No Claim discount (higher the discount, better is the prior accident record)
AgeCat: The age of the policy holder grouped into 7 categories (0,6) (0 = less than 21 and the other refer respectively to 22–25, 26–35, 36–45, 46–55, 56–65, over 66)
VAgeCat: The age of the vehicle in 6 categories (0–6 indicat groups 0,1,2,3–5,6–10,11–15, 16 and older)
Poisson Regression: Example
Example: Modelling claims frequency
Thus, we read the data and we give a look at the first 5 records
Note that the last 3 variables are actually categorical variables and can be convenient to redefine them. Note also that for automobile data not all the categories are observed.
The dependent variable is the count variable Clm_Count. Let first give a look at its distribution.
table(accidents$Clm_Count)
0 1 2 3
3555 271 15 1
barplot(table(accidents$Clm_Count))
Poisson Regression: Example
Example: Modelling claims frequency
By comparing the relative frequencies and probability from a Poisson with \(\lambda\) given by the sample mean (we do not conduct a formal test), it seems sensible to consider a Poisson regression model to fit this data
We can now try to estimate a Poisson regression model including all the available covariates (except Exp_weights for reason that we will clarified in a couple of pages)
The estimated female effect \(\hat \beta_2 \approx -0.15\) means that the expected number of claims is lower for female than male, adjusted for the remaining variables; that is, for a fixed level of the other variables, the estimated expected number of claims for a female is equal to the estimated expected number of claims for a male times \(\exp(-0.15)=0.86\)
Let us focus on the age (of the individual) variables. It seems there are no significant difference among the different ages. However, since the age variable is considered as categorical, the interpretation must be done w.r.t. to the reference category, here \(AgeCat = 2\) (26-35 years). Thus, the estimates for the age categories suggest that the expected number of claims is higher for older people, adjusting for the remaining covariates. For instance, given a value of the remaining covariates, the expected number of claims of \(AgeCat = 3\) is \(\exp(0.26)= 1.30\) times the expected number of claims of \(AgeCat = 2\), that is
We can analyse the LRT, noticing the only very small p-values is associated to NCD (marginal evidence that the subgorup of coefficients associated to VAgeCat are statistically different from zero)
We can clearly see that the models without gender and individual age have a lower AIC than the full model.
Although this is not the case, there are several applications (See the extra material example) were you can deal with the overdispersion problems. In such cases, some possible solutions are represented by
Considering the negative binomial distribution
Consider the Quasi-Poisson model.
Poisson Regression: Example
Example: Modelling claims frequency. Taking into account exposure
The cars here considered have been insured only for a part of the year, and so we should take into account this because exposure is different for each car and cars less exposed have a lower number of accidents. We should rather model the rate, that is \[ \log(\mu_i/offset_i) = \sum_{j = 1}^{p}x_{ij}\beta_j \quad \text{equivalent to} \quad \log(\mu_i) = \sum_{j = 1}^{p}x_{ij}\beta_j + \log(offset_i)\] where \(\log(offset_i)\) is called offset. In this specific case we want to model the rate \(Clm\_Count/Exp\_weights\). But setting a log–linear model for this variable is equivalent at introducing into a Poisson regression model the logarithm of the variable (Exp_weights) as an offset. This can be done directly by writing (note that parameters do not change much, but take a look to the AIC):